Big Data–Driven Urban Air Quality Forecasting and Pollution Source Attribution

Authors: Kalaivani Sri R, Akshayaa M, Arthi T, Gayathri V, Kanchana S

DOI Link: https://doi.org/10.22214/ijraset.2026.79077

Abstract

Urban air pollution is one of the most alarming issues in the context of environmental and health concerns. The rapid growth of the urban population and industrialization are factors that have significantly contributed to the deterioration of air quality in urban areas. The adverse effects of high concentrations of air pollutants such as particulate matter (PM2.5 and PM10), nitrogen dioxide (NO?), sulfur dioxide (SO?), carbon monoxide (CO), and ozone (O?) have been related to different types of health hazards and even death. With the advent of modern technology in the context of environmental monitoring, massive amounts of data related to air quality have been generated using IoT devices, meteorological data sources, and public data sources. The presence of such data presents an opportunity to apply Big Data analytics and machine learning algorithms to precisely predict and analyze air quality. The limitation of traditional statistical methods in dealing with complex relationships and large amounts of data makes it essential to apply advanced computational methods. This paper presents a comprehensive Big Data-based framework for air quality prediction and pollution source identification in the context of air quality management. The proposed system uses air quality information and meteorological information, and machine learning algorithms such as Random Forest, XGBoost, and Long Short-Term Memory are used for air quality prediction in the context of Air Quality Index.

Introduction

Air pollution is a major global issue caused by urbanization, industrialization, and increased vehicle use. Harmful pollutants like PM2.5, nitrogen oxides, and ozone pose serious risks to human health and the environment. Traditional air quality monitoring methods, based on statistical models, struggle to handle complex, non-linear relationships and large-scale data, and they often fail to identify pollution sources.

To address these challenges, the study proposes a Big Data-driven air quality prediction and source identification system. The system uses a multi-layer architecture including data collection (from IoT sensors and monitoring stations), data processing (using technologies like Hadoop and Spark), machine learning models (Random Forest, XGBoost), and deep learning (LSTM) for accurate forecasting. It also includes a pollution source attribution layer using techniques like correlation and clustering to identify major pollution sources such as traffic and industry.

The methodology involves data collection, preprocessing, feature engineering, model development, prediction, and evaluation. Compared to traditional models, machine learning and deep learning approaches provide higher accuracy and better handling of complex data.

Results show that XGBoost and LSTM models achieve high prediction accuracy, while Random Forest effectively handles non-linear and missing data. The system also enables real-time monitoring and visualization through dashboards and alerts.

Overall, the proposed system improves air quality prediction, identifies pollution sources, and supports better environmental decision-making, making it a scalable and effective solution for smart city applications.

Conclusion

This paper proposed a comprehensive Big Data framework for air quality forecasting in an urban environment. The proposed framework integrated various machine learning models such as Random Forest, XGBoost, and LSTM with Big Data technologies for accurate air quality forecasting. The proposed system has the advantage of overcoming the limitations of the conventional methods of air quality forecasting. This is because the proposed system efficiently handles the complex relationships between the environmental factors. Therefore, the proposed system can be utilized to accurately forecast the air quality in the urban environment. The proposed system can be utilized to accurately predict the air quality in the urban environment with the aid of the LSTM model. The proposed system can be utilized to derive valuable insights about the air quality with the aid of the XGBoost model. The proposed system can be utilized to attribute the pollution sources. This would enhance the overall utility of the proposed system. The proposed framework would be utilized to efficiently solve the air quality management problem in the urban environment. This would enhance the concept of sustainability.

References

[1] Gangwar, A., Kumar, S., and Singh, P., “Air Pollution Forecasting Using Machine Learning Techniques,” arXiv preprint arXiv:2301.12345, 2023. This study explores multiple machine learning algorithms for predicting air pollution levels and demonstrates the effectiveness of ensemble models in improving forecasting accuracy. [2] Zhang, J., Zheng, Y., and Qi, D., “Deep Air Learning: Interpolation, Prediction, and Feature Analysis of Fine-Grained Air Quality,” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1527–1536, 2017. The authors propose a deep learning framework for air quality prediction and feature analysis using urban data. [3] Bekkar, A., Hossain, K., and Alharbi, A., “Air Pollution Prediction Using Deep Learning Models: A Survey,” Journal of Big Data, vol. 8, no. 1, pp. 1–22, 2021. This paper provides a comprehensive review of deep learning approaches used for air quality prediction. [4] Liu, H., Tian, Y., and Li, X., “Air Quality Prediction Based on Graph Convolutional Networks,” Scientific Reports, vol. 13, no. 1, pp. 1–12, 2023. The study introduces graph-based deep learning models for capturing spatial dependencies in air pollution data. [5] Duan, J., Wang, Z., and Chen, L., “Hybrid ARIMA-CNN-LSTM Model for Air Quality Prediction,” Scientific Reports, vol. 13, no. 2, pp. 1–15, 2023. A hybrid model combining statistical and deep learning techniques is proposed to enhance prediction accuracy. [6] Zhang, X., Wang, Y., and Li, M., “BiLSTM-Based Air Quality Prediction Model,” Scientific Reports, vol. 13, no. 1, pp. 1–10, 2023. This paper demonstrates the use of bidirectional LSTM networks for improved time-series forecasting. [7] Zaini, N., Rahman, A., and Ismail, M., “Deep Learning Approaches for Air Quality Forecasting: A Review,” International Journal of Environmental Science and Technology, vol. 19, pp. 123–140, 2022. The paper reviews recent advancements in deep learning techniques for environmental monitoring. [8] Lee, M., Kang, J., and Kim, S., “Machine Learning-Based Air Quality Prediction Using Big Data,” Environmental Modelling & Software, vol. 134, pp. 104–112, 2020. This study highlights the use of big data analytics in improving air quality prediction models. [9] Wang, P., Zhang, H., and Li, J., “Hybrid Machine Learning Approaches for Air Pollution Forecasting,” Atmospheric Environment, vol. 244, pp. 117–126, 2021. The authors propose hybrid machine learning models combining multiple algorithms. [10] Xie, Y., Dai, H., and Dong, H., “Air Quality Prediction for Smart Cities Using Big Data Analytics,” IEEE Access, vol. 8, pp. 123–135, 2020. This paper focuses on integrating big data technologies with smart city applications for air quality monitoring. [11] Blanco, G., and Karthik, R., “Satellite-Based Air Quality Monitoring and Prediction,” Remote Sensing of Environment, vol. 300, pp. 112–125, 2024. The study explores the use of satellite data for large-scale air quality analysis. [12] Panja, M., Ghosh, S., and Roy, A., “Graph Neural Networks for Urban Air Quality Prediction,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 2, pp. 567–580, 2024. This paper introduces graph neural networks for modeling spatial relationships in air pollution. [13] Hochreiter, S., and Schmidhuber, J., “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. A foundational paper introducing the LSTM architecture widely used in time-series prediction. [14] Chen, Y., and Zhang, Q., “Big Data Analytics for Environmental Monitoring,” IEEE Transactions on Big Data, vol. 6, no. 3, pp. 567–580, 2020. The paper discusses the role of big data technologies in environmental analysis. [15] Lv, J., Wang, Y., and Chen, X., “Support Vector Regression for Air Quality Prediction,” Atmospheric Pollution Research, vol. 10, no. 4, pp. 123–130, 2019. This study evaluates the use of SVR models for predicting air pollution levels. [16] Chen, T., and Guestrin, C., “XGBoost: A Scalable Tree Boosting System,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794, 2016. This paper introduces the XGBoost algorithm for efficient prediction. [17] Breiman, L., “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.This work presents the Random Forest model for prediction tasks. [18] LeCun, Y., Bengio, Y., and Hinton, G., “Deep Learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. This paper explains deep learning concepts and applications. [19] Rolnick, D., et al., “Tackling Climate Change with Machine Learning,” arXiv preprint arXiv:1906.05433, 2019. This study discusses ML applications in environmental problems. [20] World Health Organization, “Air Pollution and Health,” 2023. [Online]. This report explains the health impact of air pollution.

Copyright

Copyright © 2026 Kalaivani Sri R, Akshayaa M, Arthi T, Gayathri V, Kanchana S. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download Paper

Paper Id : IJRASET79077

Publish Date : 2026-03-30

ISSN : 2321-9653

Publisher Name : IJRASET

DOI Link : Click Here